Combining Part of Speech Induction and Morphological Induction
نویسنده
چکیده
Linguistic information is useful in natural language processing, information retrieval and a multitude of sub-tasks involving language analysis. Two types of linguistic information in all languages are part of speech and morphology. Part of speech information reflects syntactic structure and can assist in tasks such as speech recognition, machine translation and word sense disambiguation. Morphological information describes the structure of words and has application in automated spelling correction, natural language generation and information retrieval for morphologically complex languages. Machine learning methods in natural language processing acquire linguistic information from corpora of natural language text. While supervised learning algorithms are trained on texts that have been annotated with linguistic features, induction algorithms learn linguistic information from unannotated corpora. Such algorithms avoid any requirement for linguistically annotated training data a resource that is highly time-intensive to produce. However, in learning from unannotated corpora, only limited sources of information are available. In practice, part of speech induction methods usually learn from distributional evidence about the contexts in which words occur. In contrast, morphological induction methods tend to be based on the orthographic structure of the words in the corpus. However, a word’s morphological form and syntactic function often correlate: a word’s morphology may indicate its syntactic function and vice versa. Thus, both distributional and orthographic evidence may be useful for both tasks. This thesis investigates the extent to which the information induced by one learner can be used to bootstrap the other: specifically, whether the incorporation of explicit annotations from one learner can improve the performance of the other.
منابع مشابه
Combining Distributional and Morphological Information for Part of Speech Induction
In this paper we discuss algorithms for clustering words into classes from unlabelled text using unsupervised algorithms, based on distributional and morphological information. We show how the use of morphological information can improve the performance on rare words, and that this is robust across a wide range of languages.
متن کاملUsing Morphological and Distributional Cues for Inductive Part-of-Speech Tagging
In this paper we evaluate the role of morphological and distributional cues in PoS induction, using an incremental and unsupervised learning algorithm with clustering on a vector space.
متن کاملUnsupervised Part-of-Speech Acquisition for Resource-Scarce Languages
This paper proposes a new bootstrapping approach to unsupervised part-of-speech induction. In comparison to previous bootstrapping algorithms developed for this problem, our approach aims to improve the quality of the seed clusters by employing seed words that are both distributionally and morphologically reliable. In particular, we present a novel method for combining morphological and distrib...
متن کاملDesign and Implementation of an Intelligent Part of Speech Generator
The aim of this paper is to report on an attempt to design and implement an intelligent system capable of generating the correct part of speech for a given sentence while the sentence is totally new to the system and not stored in any database available to the system. It follows the same steps a normal individual does to provide the correct parts of speech using a natural language processor. It...
متن کامل